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^ This paper develops a theory for group Lasso using a concept called strong group sparsity. 

, ^ Our result shows that group Lasso is superior to standard Lasso for strongly group-sparse signals. 

^> This provides a convincing theoretical justification for using group sparse regularization when 

the underlying group structure is consistent with the data. Moreover, the theory predicts some 
limitations of the group Lasso formulation that are confirmed by simulation studies. 

H-] 1 Introduction 

\\c arc interested in the sparse learning problem for least squares regression. Consider a set of p 
^ basis vectors {xi, . . . ,Xp} where Xj G for each j. Here, n is the sample size. 

^ t/^^ Denote by X the n x p data matrix, with column j of X being Xj. Given an observation 

y = [yi, . . . ,yn] G M" that is generated from a sparse Hnear combination of the basis vectors plus a 
^ stochastic noise vector e € M"': 

> _ <^ _ 

g y = X^ + e = J]^,x, + e, 

^ where we assume that the target coefficient P is sparse. Throughout the paper, we consider fixed 



design only. That is, we assume X is fixed, and randomization is with respect to the noise e. Note 
that we do not assume that the noise e is zero-mean. 



O Define the support of a sparse vector /3 eW as 

supp(^) = {j : ^ 0}, 
^ and ||/3||o = |supp(/3)|. A natural method for sparse learning is Lq regularization: 

Plo = arg min \\XP - y||| subject to ||/3||o < k, 



where k is the sparsity. Since this optimization problem is generally NP-hard, in practice, one often 
consider the following Li regularization problem, which is the closest convex relaxation of Lq: 



Pli = arg min 



^\\XP-y\\l + X 

where A is an appropriately chosen regularization parameter. This method is often referred to as 
Lasso in the statistical Hterature. 
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In practical applications, one often knows a group structure on the coefficient vector P so that 
variables in the same group tend to be zeros or nonzeros simultaneously. The purpose of this paper 
is to show that if such a structure exists, then better results can be obtained. 



2 Strong Group Sparsity 

For simplicity, we shall only consider non- overlapping groups in this paper, although our analysis 
can be adapted to handle moderately overlapping groups. 

Assume that {I, . . . ,p} = U^iGj is partitioned into m disjoint groups Gi, • • • 7 Gm'. Gi n 
Gj = when i / j. Moreover, throughout the paper, we let kj = and /cq = maxjgji^ „j} kj. 
Given Sell,..., m} that denotes a set of groups, we define Gs = Uj^sGj. 

Given a subset of variables F C {1, . . . ,p} and a coefficient vector /? e W, let jSp be the vector 
in RI^I which is identical to /3 in F. Similar, Xp is the n x |F| matrix with columns identical to X 
in F. 

The following method, often referred to as group Lasso, has been proposed to take advantage of 
the group structure: 



/? = argmin 



-\\XP-y\\l + Xj2\\PG,\\2 



The purpose of this paper is to develop a theory that characterizes the performance of ([T]). We 
are interested in conditions under which group Lasso yields better estimate of P than the standard 
Lasso. 

Instead of the standard sparsity assumption, where the complexity is measured by the number 
of nonzero coefficients k, we introduce the strong group sparsity concept below. The idea is to 
measure the complexity of a sparse signal using group sparsity in addition to coefficient sparsity. 

Definition 2.1 A coejficient vector P £W is {g, k) strongly group-sparse if there exists a set S of 
groups such that 

supp(^)cG5, \Gs\<k, \S\<g. 

The new concept is referred to as strong group-sparsity because k is used to measure the sparsity 
of P instead of ||;5||o. If this notion is beneficial, then A:/||/3||o should be small, which means that 
the signal has to be efficiently covered by the groups. In fact, the group Lasso method does not 
work well when /c/||;3||o is large. In that case, the signal is only weak group sparse, and one needs to 
use ll^llo to precisely measure the real sparsity of the signal. Unfortunately, such information is not 
included in the group Lasso formulation, and there is no simple fix of this problem using variations 
of group Lasso. This is because our theory requires that the group Lasso regularization term is 
strong enough to dominate the noise, and the strong regularization causes a bias of the order 0{k) 
which cannot be removed. This is one fundamental drawback which is inherent to the group Lasso 
formulation. 

3 Related Work 

The idea of using group structure to achieve better sparse recovery performance has received much 
attention. For example, group sparsity has been considered for simultaneous sparse approximation 
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[T^ and multi-task compressive sensing [3] from the Bayesian hierarchical modeling point of view. 
Under the Bayesian hierarchical model framework, data from all sources contribute to the estima- 
tion of hyper-parameters in the sparse prior model. The shared prior can then be inferred from 
multiple sources. Although the idea can be justified using standard Bayesian intuition, there are 
no theoretical results showing how much better (and under what kind of conditions) the resulting 
algorithms perform. 

In [H], the authors attempted to derive a bound on the number of samples needed to recover 
block sparse signals, where the coefficients in each block are either all zero or all nonzero. In our 
terminology, this corresponds to the case of group sparsity with equal size groups. The algorithm 
considered there is a special case of ([T]) with Xj — > 0+. However, their result is very loose, and does 
not demonstrate the advantage of group Lasso over standard Lasso. 

In the statistical literature, the group Lasso ([T| has been studied by a number of authors 
[131 m [71 El IB] . There were no theoretical results in [Bj. Although some theoretical results were 
developed in [HIT], neither showed that group Lasso is superior to the standard Lasso. 

The authors of [5] showed that group Lasso can be superior to standard Lasso when each group 
is an infinite dimensional kernel, by using an argument completely different from ours (they relied on 
the fact that meaningful analysis can be obtained for kernel methods in infinite dimension). Their 
idea cannot be adapted to show the advantage of group Lasso in finite dimensional scenarios of 
interests such as in the standard compressive sensing setting. Therefore our analysis, which focuses 
on the latter, is complementary to their work. 

Another related work is [5], where the authors considered a special case of group Lasso in the 
multi-task learning scenario, and showed that the number of samples required for recovering the 
exact support set may be smaller for group Lasso under appropriate conditions. However, there 
are major differences between our analysis and their analysis. For example, the group formulation 
we consider here is more general and includes the multi-task scenario as a special case. Moreover, 
we study signal recovery performance in 2-norm instead of the exact recovery of support set in 
their analysis. The sparse eigenvalue condition employed in this work is often considerably weaker 
than the irrepresentable type condition in their analysis (which is required for exact support set 
recovery). Our analysis also shows that for strongly group-sparse signals, even when the number of 
samples is large, the group Lasso can still have advantages in that it is more robust to noise than 
standard Lasso. 

In the above context, the main contribution of this work is the introduction of the strong group 
sparsity concept, under which a satisfactory theory of group Lasso is developed. Our result shows 
that strongly group sparse signals can be estimated more reliably using group Lasso, in that it 
requires fewer number of samples in the compressive sensing setting, and is more robust to noise in 
the statistical estimation setting. 

Finally, we shall mention that independent of the authors, results similar to those presented 
in this paper have also been obtained in [S] with a similar technical analysis. However, while our 
paper studies the general group Lasso formulation, only the special case of multi-task learning is 
considered in 

4 Assumptions 

The following assumption on the noise is important in our analysis. It captures an important 
advantage of group Lasso over standard Lasso under the strong group sparsity assumption. 
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Assumption 4.1 (Group noise condition) There exist non-negative constants a,b such that for 
any fixed group j G {1, . . . , m}, and rj £ (0, 1); with probability larger than l — rj, the noise projection 
to the j-th group is bounded by: 

The importance of the assumption is that the concentration term y/— Inr/ does not depend on 
k. This reveals a significant benefit of group Lasso over standard Lasso: that is, the concentration 
term does not increase when the group size increases. This implies that if we can correctly guess 
the group sparsity structure, the group Lasso estimator is more stable with respect to stochastic 
noise than the standard Lasso. 

We shall point out that this assumption holds for independent sub-Gaussian noise vectors, where 
gi(ei-iEei) ^ gi cr /2 £qj, g^jj ^ g^j^j i = 1, . . . ,n. It Can be shown that one may choose a = 2.8 and 
b = 2.4 when rj G (0,0.5). Since a complete treatment of sub-Gaussian noise is not important for 
the purpose of this paper, we only prove this assumption under independent Gaussian noise, which 
can be directly calculated. 

Proposition 4.1 Assume the noise vector e are i ndependent Gaussians: ej — Ee^ ~ N{0,af), where 



each (Ti < a (i = 1, . . . ,n). Then Assumption 4-1 holds with a = a and b = V2a. 



The next assumption handles the case that true target is not exactly sparse. That is, we only 
assume that XP ^ Ey. 

Assumption 4.2 (Group approximation error condition) There exist 6a, 6b > such that 
for all group j £ {!,... ,m}: the projection of error mean Ee to the j-th group is bounded by: 

^-0.5 



WiXXXcJ-'^-'X^myV^ < ^k,6a + 6b. 



As mentioned earlier, we do not assume that the noise is zero-mean. Hence Ee may not equal 
zero. In other words, this condition considers the situation that the true target is not exactly sparse. 
It resembles algebraic noise in [H] but takes the group structure into account. Similar to [II], we 
have the following result. 

Proposition 4.2 Consider a (g, k) strongly group sparse coefficient vector f3 such that 

1 



n 



■||X/3-Ey||^< 



and oo, 60 ^ 0. Then there exists {g' , k') strongly group sparse j3' such that k'oQ+g'bQ < 2{kaQ+ gbQ) , 
WXP' — Ey||2 < WXP — Ey||2, supp(;5) C supp(/9'), and for all group j: 



The proposition shows that if the approximation error of /? is A = \\X(3 — Ky\\2/\/n, then we 
may find an alternative target (3' with sirnilar sparsity for which we can tak e 6a = cqA/ ka^ -\- 6q 



and 6b = b^/S./ \/ka^ -|- 6q in Assumption 4.2 This means that in Theorem 5.1 below, by choosing 
oq = a and b^ = by^ln{m/r]), the contribution of the approximation error to the reconstruction 
error \\/3 — j3\\2 is 0(A). Note that this assumption does not show the benefit of group Lasso over 
standard Lasso. Therefore in order to compare our results to that of the standard Lasso, one may 
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consider the simple situation where 5a = Sb = 0. That is, the target is exactly sparse. The only 



reason to include Assumption |4.2| is to illustrate that our analysis can handle approximate sparsity. 

The last assumption is a sparse eigenvalue condition, used in the modern analysis of Lasso 
(e.g., [3 HI])- It is also closely related to (and slightly weaker than) the RIP (restricted isometry 
property) assumption |^ in the compressive sensing literature. This assumption takes advantage of 
group structure, and can be considered as (a weaker version of) group RIP. We introduce a definition 
before stating the assumption. 

Definition 4.1 For all F C {1, . . . ,p}, define 

p.{F) =inf : supp(/?) C F 

p+iF) =sup |-||X/3||2/||/3||2 : supp(/3) C F 
I n 

Moreover, for all 1 < s < p, define 

p-is) =ml{p^{Gs) : 5 C {1, . . . ,m}, \Gs\ < s}, 
p+(s) =sup{p+(G5) : 5 C {1, . . . ,m}, \Gs\ < s}. 

Assumption 4.3 (Group sparse eigenvalue condition) There exist s,c> such that 



P-i'S) 



< c. 



Assumption |4.3| illustrates another advantage of group Lasso over standard Lasso. Since we 
only consider eigenvalues for sub-matrices consistent with the group structure {Gj}, the ratio 
p+{s)/ p-{s) can be significantly smaller than the corresponding ratio for Lasso (which consid- 
ers all subsets of {1, . . . ,p} up to size s). For example, assume that all group sizes are identical 
ki = . . . = km = ko, and s is a multiple of ko. For random projections used in compressive sensing 



applications, only n = 0{s + {s/ko)liim) projections are needed for Assumption 4.3 to hold. In 
comparison, for standard Lasso, we need n = 0{slnp) projections. The difference can be significant 
when p and ko are large. More precisely, we have the following random projection sample complexity 
bound for the group sparse eigenvalue condition. Although we assume Gaussian random matrix in 
order to state explicit constants, it is clear that similar results hold for other sub-Gaussian random 
matrices. 



Proposition 4.3 (Group-RIP) Suppose that elements in X are iid standard Gaussian random 
variables N{0, 1). For any t > and 5 G (0, 1), let 

g 

n > -^[InS + t + kln(l + 8/6) +gln(em/g)]. 

Then with probability at least 1 — e~*, the random matrix X £ M"^^ satisfies the following group-RIP 
inequality for all {g, k) strongly group-sparse vector (3 E W, 

{i-6)m\2<^\\xp\\2<{i+6)m2. (2) 



5 



5 Main Results 



Our main result is the following signal recovery (2-norm parameter estimation error) bound for 
group Lasso. 

Theorem 5.1 Suppose that Assumption \4-l\ Assumption \4-S\ and Assumption \4^ are valid. Take 
Xj = {^A^Jk'j^ S) I ^pa, where both A and B can depend on data y. Given rj £ (0, 1), with probability 
larger than 1 — rj, if the following conditions hold: 

• B > 4maxj pj^{Gj)^l'^{h^J\n{m/'q) + 6b,/n), 

• j3 is a (g, k) strongly group-sparse coefficient vector, 

• s > k + ko, 

• Let i = s — {k — ko) + 1, and gt = min{|5| : \Gs\ > ^, S* C {1, . . . , m}}, we have 



lA^ + giB^ 



72{kA^ + 5^2) ' 



/4.5 



< , ; ^ (1 + 0.25c-^)^/A^k + gB^. 
P-{s)^yn 



then the solution of ^ satisfies: 



The first four conditions of the theorem are not critical, as they are just definitions and choices for 
Xj. The fifth assumption is critical, which means that the group sparse eigenvalue condition has to 
be satisfied with some c that is not too large. In order to satisfy the condition, £ should be chosen 
relatively large as the right hand side is linear in i. However, this implies that s also grow linearly. 



It is possible to find s so that the condition is satisfied when c in Assumption 4.3 grows sub-linearly 



in s. Consider the situation that Sa = 6b = 0. If the conditions of Theorem 5.1 is satisfied, then 



\W-(3g = Oiik + gln{m/r]))/n). 
In comparison. The Lasso estimator can only achieve the bound 

\WLi-Ml = o{m\oHp/v))/n). 

If k/\\f3\\o <C ln{p/ri) (which means that the group structure is useful) and 5 <C \\(3\\o, then the group 
Lasso is superior. This is consistent with intuition. However, if /c » ||/3||o ln(p/r/), then group Lasso 
is inferior. This happens when the signal is not strongly group sparse. 

Theorem |5.1| also suggests that if the group sizes are not even, then group Lasso may not work 
well when the signal is contained in small sized groups. This is because in such case g£ can be 
significantly smaller than g even with relatively large £, which means we have to choose a large s 



and small c, implying a poor bound. This prediction is confirmed in Section 6^ using simulated 
data. Intuitively, group Lasso favors large sized groups because the 2-norm regularization for large 
group size is weaker. Adjusting regularization parameters Xj not only fails to work in theory, but 
also impractical since it is unrealistic to tune many parameters. This unstable behavior with respect 
to uneven group size may be regarded as another drawback of the group Lasso formulation. 



In the following, we present two simplifications of Theorem 5.1 that are easier to interpret. The 



first is the compressive sensing case, which does not consider stochastic noise. 
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Corollary 5.1 (Compressive sensing) Suppose that Assumption 4-1 o,nd Assumption 4-2 are 
valid with a = b = 6b = 0. Take Xj = Ay/kj maxj p-^-{Gj)^^'^6a. Let j3 he a {k,g) strongly group- 
sparse signal, £ = k, and s = 2k -\- ko — 1. If {p+{s) — p-{2s)) / p-{s) < 1/V72, then the solution of 
^ satisfies: 

IW-Ph < ^^^^^nmxp+{G,)'/'6aVk. 

If ba = 0, then we can achieve exact recovery. Moreover, Proposition |4]2] implies that we may choose 
a target with similar sparsity such that ba\fk = 0(||X/3 — Ey||2/A/^). This implies a bound 



|2 = 0(||X/3-Ey||2/^Ai 



If we have even sized groups, the number of samples n required for Corollary 5.1 to hold (that is, 
(p+(s) — p_(2s))/p_(s) < 1/a/72) is 0(A; + (7ln(m/5)), where g = kjk^. In comparison, although a 
similar result holds for Lasso, it requires sample size of order ||^||o ln(p/||/3||o). Again, group Lasso 
has a significant advantage if A;/||/3||o ^ ln(p/||;3||o), g <C ||;5||o, and p is large. 

The following corollary is for even sized groups, and the result is simpler to interpret. For 
standard Lasso, B = 0(^/lnp), and for group Lasso, B = 0(\/lnm). The benefit of group Lasso is 
the division of B^ by /cq in the bound, which is a significant improvement when the dimensionality 
p is large. The disadvantage of group Lasso is that the signal sparsity ||,3||o is replaced by the group 
sparsity k. This is not an artifact of our analysis, but rather a fundamental drawback inherent to 
the group Lasso formulation. The effect is observable, as shown in our simulation studies. 



Corollary 5.2 (Even group size) Suppose that Assumption 4-i and Assumption \4-^ are valid. 



Assume also that all groups are of equal sizes: ko = kj for j = 1, . . . ,m. Given rj G (0, 1), let 

Xj = {Ay/kQ + B)/y^, 



where A > 4maxj p+(Gj)^/^(a + 5ay/n) and B > 4maxj /9+(Gj)^/^(6-y/ln(m/r/) + 5b^/n). Let (5 he 
a (k, k/ko) strongly group-sparse signal. With prohahility larger than I — rj, if 

QV2{p+{k + £) - p^{2k + 2£))/p^{k + i) < ^JJJk 

for some ^ > that is a multiple of ko, then the solution of ^ satisfies: 

11/3 - Ph <P-{k + lyWA:^ + A.h£/k)^A^ + B^/ko^/kfn. 



6 Simulation Studies 

We want to verify our theory by comparing group Lasso to Lasso on simulation data. For quan- 
titative evaluation, the recovery error is defined as the relative difference in 2-norm between the 
estimated sparse coefficient vector (5est and the ground-truth sparse coefficient (5: \\(3est — /5||2/||/5||2- 
The regularization parameter A in Lasso is chosen with five-fold cross validation. In group 
Lasso, we simply suppose the regularization parameter Xj = {X^Jkj)/^/n for j = 1,2,..., m. The 
regularization parameter A is then chosen with five-fold cross validation. Here we set = in the 
formula Xj = 0{Ay^-\-B). Since the relative performance of group Lasso versus standard Lasso is 
similar with other values of B, in order to avoid redundancy, we do not include results with B ^ 0. 
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6.1 Even group size 



In this set of experiments, the projection matrix X is generated by creating an n xp matrix with i.i.d. 
draws from a standard Gaussian distribution A/^(0, 1). For simplicity, the rows of X are normalized 
to unit magnitude. Zero-mean Gaussian noise with standard deviation a = 0.01 is added to the 
measurements. Our task is to compare the recovery performance of Lasso and Group Lasso for 
these {g, k) strongly group sparse signals. 



6.1.1 With correct group structure 

In this experiment, we randomly generate (5, A:) strongly group sparse coefficients with values ±1, 
where p = 512, k = 64 and g = 16. There are 128 groups with even group size of ko = 4. Here the 
group structure coincides with the signal sparsity: k = \\(3\\o. 

Figure [T] shows an instance of generated sparse coefficient vector and the recovered results by 
Lasso and group Lasso respectively when n = 3/c = 192. Since the sample size n is only three times 
the signal sparsity k, the standard Lasso does not achieve good recovery results, whereas the group 
Lasso achieves near perfect recovery of the original signal. 



Figure |2(a) shows the effect of sample size n, where we report the averaged recover error over 
100 random runs for each sample size. Group Lasso is clearly superior in this case. These results 
show that the the group Lasso can achieve better recovery performance for (g, k) strongly group 
sparse signals with fewer measurements, which is consistent with our theory. 



(a) Original 




50 100 150 200 250 300 350 400 450 500 
(b) Lasso 
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50 100 150 200 250 300 350 400 450 500 
(b) Group Lasso 



50 100 150 200 250 300 350 400 450 500 



Figure 1: Recovery results when the assumed group structure is correct, (a) Original data; (b) 
results with Lasso (recovery error is 0.3444); (c) results with Group Lasso (recovery error is 0.0419) 



To study the effect of the group number g (with k fixed), we set the sample size n = 160 and 



then change the group number while keeping other parameters unchanged. Figure 2(b) shows the 
recovery performance of the two algorithms, averaged over 100 random runs for each sample size. As 
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expected, the recovery performance for Lasso is independent to the group number within statistical 
error. Moreover, the recovery results for group Lasso are significantly better when the group number 
g is much smaller than the sparsity k = 64. When g = k, the group Lasso becomes identical to 
Lasso, which is expected. This shows that the recovery performance of group Lasso degrades when 
g/k increases, which confirms our theory. 



6.1.2 With incorrect group structure 

In this experiment, we assume that the known group structure is not exactly the same as the 
sparsity of the signal (that is, k > ||/3||o)- We randomly generate strongly group sparse coefficients 
with values ±1, where p = 512, ||;5||o = 64 and g = 16. In the first experiment, we let k = 4||;5||o, 
and use m = 32 groups with even group size of ko = 16. 

Figure [3] shows one instance of the generated sparse signal and the recovered results by Lasso 
and group Lasso respectively when n = 3||;5||o = 192. In this case, the standard Lasso obtains 



better recovery results than the group Lasso. Figure 2(a) shows the effect of sample size n, where 
we report the averaged recover error over 100 random runs for each sample size. The group Lasso 
recovery performance is clearly inferior to that of the Lasso. This shows that group Lasso fails when 
A;/||/3||o is relatively large, which is consistent with our theory. 

To study the effect of /c/||/3||o on the group Lasso performance, we keep \\f3\\o fixed, and simply 



vary the group size as ko = 1,2,4,8,16,32,64 with k/\\P\\o = 1,1,1,2,4,8,16. Figure |4(b)| shows 
the performance of the two algorithms with different group sizes ko in terms of recovery error. It 
shows that the performance of group Lasso is better when /c/||;3||o = 1. However, when A;/||/3||o > 1, 
the performance of group Lasso deteriorates. 



6.2 Uneven group size 

In this set of experiments, we randomly generate {g,k) strongly sparse coefficients with values ±1, 
where p = 512, and 5 = 4. There are 64 uneven sized groups. The projection matrix X and noises 
are generated as in the even group size case. Our task is to compare the recovery performance of 
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Figure 3: Recovery results when the assumed group structure is incorrect, (a) Original data; (b) 
results with Lasso (recovery error is 0.3616); (c) results with Group Lasso (recovery error is 0.6688) 




(a) 




(b) 



Figure 4: Recovery performance: (a) recovery error vs. sample size ratio n//c; (b) recovery error vs. 
group size ko 
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Lasso and Group Lasso for {g, k) strongly sparse signals with ||;9||o = k. To reduce the variance, we 
run each experiment 100 times and report the average performance. 

In the first experiment, the group sizes of 64 groups are randomly generated and the 5 = 4 active 
groups are randomly extracted from these 64 groups. Figure |5(a) shows the recovery performance 
of Lasso and group Lasso with increasing sample size (measurements) in terms of recovery error. 
Similar to the case of even group size, the group Lasso obtains better recovery results than those 
with Lasso. It shows that the group Lasso is superior when the group sizes are randomly uneven. 
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Figure 5: Recovery performance: (a) g active groups have randomly uneven group sizes; (b) half of 
g active groups are single element groups and another half of g active groups have large group size 



As discussed after Theorem |5.1[ because group Lasso favors large sized groups, if the signal is 
contained in small sized groups, then the performance of group Lasso can be relatively poor. In 



order to confirm this claim of Theorem 5J. we consider the special case where 32 groups have large 
group sizes and each of the remaining 32 groups has only one element. First, we consider the case 
where half oi g = A active groups are extracted from the single element groups and the other half 
oi g = A active groups are extracted from the groups with large size. Figure 5(b) shows the signal 
recovery performance of Lasso and group Lasso. It is clear that the group Lasso performs better. 



but the results are not as good as those of Figure 5 (a 



Moreover, Figure 6 (a) [ shows the recovery performance of Lasso and group Lasso when all of the 
g = A active groups are extracted from large sized groups. We observe that the relative performance 



of group Lasso improves. Finally, Figure |6(b) shows the recovery performance of Lasso and group 
Lasso when all of the g = A active groups are extracted from single element groups. It is obvious 



that the group Lasso is inferior to Lasso in this case. This confirms the prediction of Theorem [5]T 
that suggests that group Lasso favors large sized groups. 



7 Conclusion 

In this paper we introduced a concept called strong group sparsity that characterizes the signal 
recovery performance of group Lasso. In particular, we showed that group Lasso is superior to 
standard Lasso when the underlying signal is strongly group-sparse: 

• Group Lasso is more robust to noise due to the stability associated with group structure. 
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Figure 6: Recovery performance: (a) all g active groups have large group size; (b) all g active groups 
are single element groups 



• Group Lasso requires a smaller sample size to satisfy the sparse eigenvalue condition required 
in the modern sparsity analysis. 

However, group Lasso can be inferior if the signal is only weakly group-sparse, or covered by groups 
with small sizes. Moreover, group Lasso does not perform well with overlapping groups (which is 
not analyzed in this paper). Better learning algorithms are needed to overcome these limitations. 
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A Proof of Proposition 4.1 



Without loss of generality, we may assume cxj > for all i (otherwise, we can still let ai > and 
then just take the limit cjj ^ for some i). 

For notation simplicity, we remove the subscript j from the group index, and consider group G 
with k variables. 

Let S be the diagonal matrix with cjj as its diagonal elements. We can find an n x k matrix 
Z = XG(XJSXG)"°■^ such that Z^SZ = h^k- Let ^ = Z^(e - Ee) G R^. Since Vu G M", 



we have 



\\{x^Xg)-^-'xM2 = Wiz^zy'^-^z^vh, 



\\{X^XG)-'-'X^{e-Ee)\\l ^ _ v^^ Z{Z^ Z)-' Z^ v 



4 C v&R" V' ZZ'V 

u^(Z^Zy^u u^Z^T^Zu 
= sup ^ = sup 

v^T.v 2 
< sup — =p — < fJ . 

„gKn V V 

Therefore, we only need to show that with probability at least 1 — r/ for all rj G (0, 1): 

m\2<aVk + by^^l^ (3) 

with a = 1 and b = \/2. 

To prove this inequality, we note that the condition Z'^'EZ = Ikxk means that the covariance 
matrix of ^ is Ikx,k- Therefore the components of ^ are k iid Gaussians A^(0, 1), and the distribution 
of 11^112 is x^- Many methods have been suggested to approximate the tail probability of distri- 
bution. For example, a well-known approximation of ||^||2 is the normal N{\/k — 0.5, 0.5), which 
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would imply a = 6 = 1 in (|3]). In the following, we derive a slightly weaker tail probability bound 
using direct integration of tail probability for 5 > \fk: 



„-{x'+5)2/2+(fc-l)ln(l+aV5)^^ 
>0 



- r(fc/2)2'=/2 y,>o 

- r(A;/2)2'=/2 - V / ; 

<^O^g-'5"/2+0.5fc+(fc-l)(<5/v^-l) < ^/q^^-(5-v^)2/2_ 

This implies that ([s]) holds with a = 1 and 6 = a/2. 

Note that in the above derivation, we have used the following Sterling lower bound for the 
Gamma function 

r(0.5A;) > V2^(0.5fc)°-5'^-°-^e-°-^^ 



B Proof of Proposition 4.2 



We consider the following group-greedy procedure starting with /J^^^ = /3, and form {k^^\g'^^^) 
strongly group sparse as follows for ^ = 1, 2, . . . 

• let r(^-i) = - Ey, 

. let = argmax,[||(XT XGj-°-5xT.r(^-i)||2/yvr^]- 

• let P^^^ = and then reset its coefficients in group Gj as = ^^-(XJ^.Xg.O^^XJt^^-^) 
where j = 

It is not difficult to check that 



ki^) _ yfc(^-i) < kj, c/W - 5(^-1) < 1, with j = jW. Therefore if for all < ^ < t, we have 

\\{Xl^XG,)-'-'xl/'%/^k,al + lA > V^A/^kal + bl 



arg max 
j 
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then by summing over i = 1, . . . , t, t + 1, we obtain 



nA2=||rWi>^[||r(^-i)||i-||rW||2] 



£=1 

t+1 



1=1 

>n[(A:(*+i) - A;)ag + {g^'+') - g)hl]A'/{kal + bl). 

This implies that 

Therefore if we let t be the first time /c^^+^^Cq + g'^^^^^bl > 2{kal + gbl), then there exists i < t, 



such that P' = (3^ > satisfies the requirement. 



C Proof of Proposition 4.3 



The following lemma is taken from [9J. Since the proof is simple, it is included for completeness. 

Lemma C.l Consider the unit sphere S^^^ = {x : ||a;||2 = 1} in M'^' (k > 1). Given any e > 0, 
there exists an e-cover Q C S'^^^ such that min^gg ||x — q\\2 < s for all \\x\\2 = 1, with \Q\ < 
{l + 2/e)K 

Proof Let B'' = {x : \\x\\2 < 1} be the unit ball in M*^. Let Q = {9i}j=i,...,|Q| C S^~^ be a maximal 
subset such that \\qi — qj\\2 > £ for all i / j. By maximality, Q is an e-cover of S^~^. Since the 
balls Qi + {£/2)B^ are disjoint and belong to (1 + £/2)B^, we have 

vol{q^ + {£/2)B'') < vol{{l + £/2)B^). 

i<\Q\ 

Therefore, 

\Q\{e/2fvol{B'') < {l + £/2fvol{B^), 
which implies that |Q| < (1 + 2/e)'=. ■ 



The following concentration result for distribution is similar to Proposition 4J. This is where 
the Gaussian assumption is used in the proof. A similar result holds for sub-Gaussian random 
variables. 

Lemma C.2 Let ^ & he a vector of n iid standard Gaussian variahles: ~ A^(0,1). Then 
Ve > 0.- 

Prn||?||2-V^| >el <3e-^'/2. 



Proof Proposition |4.1| implies that 

Pr[||e||2- V^>e] <^/Kbe-'''''. 
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Using identical derivation in the proof of Proposition 4.1 and let 5 = ^Jn — e and k = n, we obtain: 



r(A:/2)2'=/2 



x<0 
2 , 



r(A;/2)2'=/2 - 

Combining the above two inequalities, we obtain the desired bound. 



The derivation of the following estimate employs a standard proof technique (for example, see 



Lemma C.3 Suppose X is generated according to Proposition 4-3 . For any fixed set S C {1, . . . ,p} 
with \S\ = k and < 6 < 1, we have with probability exceeding 1 — 3(1 + 8/6)^6""'^'^^^: 



ii-s)m\2<^\\xsf)h<ii+5m\2 (4) 



for all 



Proof It is enough to prove the conclusion in the case of \\P\\2 = 1- According to Lemma C.l 
given ei > 0, there exists a finite set Q = {qi} with \Q\ < (1 + 2/ei)'^ such that ||gi||2 = 1 for all i, 

and miuj - qi\\2 < ei for all ||/?||2 = 1. 

For each i, Since elements of ^ = XsQi are iid Gaussians A^(0, 1), Lemma C.2 implies that 
Ve2 > 0: 

Pr [WlXsQih - V^hihl > V^e2] < 3e-"^2/2. 

Taking union bound for all qi G Q, we obtain with probability exceeding 1 — 3(1 + 2/ei)'^e~"'^2/2; 
for all qi e Q, 

(1-62) < -^WXsqih < (1 + 62). 



Now, we define p as the smallest nonnegative number such that 

1 



n 



\\XsP\\2<{l + p) (5) 



for all /3 G M'^ with \\P\\2 = 1- Since for all \\P\\2 = 1, we can find qi & Q such that — qi\\2 < ei, 
we have 

\\Xs(3\\2 < WXsqih + \\Xs{f3 - qi)\\2 < V^{1 + £2 + (1 + p)ei), 

where we used ^ in the derivation. Since p is the smallest non-negative constant for which ^ 
holds, we have 

/^(l + p)< V^(l + £2 + (1 + p)ei). 



which implies that 

P< (ei + e2)/(l-ei). 



16 



Now we choose ei =5/4 and £2 = 5/2. Since < 5 < 1, it is easy to see that p < 6. This proves 
the upper bound. For the lower bound, we note that for all ||/3||2 = 1 with ||/? — qi\\2 < ei, we have 



WXsPh > WXsQih - WXsiP - qi)h > V^(l - £2 - (1 + p)ei), 
which leads to the desired result. 



Proof of Proposition 4.3 



For each subset 5 C {1, . . . , m} of groups with \S\ < g and IG^I < k, we know from C.3 that for all 
P such that supp(/3) C Gs- 

{l-5)\\(3h<^\\X(3h<{l + 6)\\(3h 

with probability exceeding 1 - 3(1 + 8/6)''e~"'^^'^^ . 

Since the number of such groups S can be no more than Cm < i^n^/g)^, by taking the union 
bound, we know that the group RIP in Equation ^ fails with probability less than 

D Technical Lemmas 

The following lemmas are adapted from [Hj to handle group sparsity structure. Similar techniques 
can be found in [2J. The first lemma is in [H]. The proof is included for completeness. 

Lemma D.l Let A = X/n, and let I and J be non-overlapping indices in {1, ... ,p}. We have 

\\Ai,j\\2 < VMi) -p-{i^ J)){p+{J) -p-{i^ J)), 

where the matrix 2-norm is defined as ||^/,j||2 = sup||„||2=||„||2=i \'u^ ■^i,jv\. 

Proof Consider v with vi G M'^'^I and vj £ mI'^': positive semi-definiteness implies that 



p+{I)\\vi\\l + 2tvjAijvj + t^p+{J)\\vj\\l 
>vjAijvi + 2tvjAijvj + t'^vjAj^jvj 
>p^{IUJ){\\vi\\l+t^vjg) 



for all t. This implies that 

\vjAijvj\ < ViP+il) - P-{I U J))(p+(J) - p-(/U J))||?;/||2|bj||2, 
which leads to the desired result. ■ 

The next lemma uses the previous result to control the contribution of the non-signal part 
of an error vector u to the product UqAg^cuc^. 
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Lemma D.2 Given u eW and S C {1, . . . , m}. Consider £> 1 and define 

= min < 



jeS' 



> . 



Let So G {1, . . . ,m} — S contain indices j of largest values of \\uGj\\2/^j (j ^ S), and satisfies the 
condition i < \Gso\ < i + ko- Let G = GsUGso- Then 



I E |K.||i<(2A_)-i5^A,||«G,||2 



and 



<xz'p+\\uGhY.xA UGi\\2, 



where p+ = V(p+(G) - p-{\G\ + ^ + fco - l))(p+(^ + fco - 1) - p-{\G\ +i + ko-l)). 

Proof Without loss of generality, we assume that S = {1, . . . and we assume that j > g \s in 
descending order of ||tiGj Ib/Aj. Let 5*0, ^i, ... be the first, second, etc, consecutive blocks of j > g, 
such that (, < \Gsf.\ < £ + ko (except for the last 5"^). If we let = Gs^, then: 



< 



< 



< 



< 



J^SuSo 

_HSuSo 

Y ^jW^G.h 
j^SuSo 

[E,^S^jhG,\\2? 

4A2 



max WuGih/Xj 
jfSuSo ' 



minWuGjh/Xj 



Y ^^W'^Gjh/ Y ^. 



jeSo 
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This proves the first inequality of the lemma. Similarly, we have 



Therefore 



Y.\\''g42=^ Yl \\UG, 

k>i V ieSfc 



k>l 



^ill^cjb / mill \\UG42/Xj 



fc>iVi6'5'fc VieSfe_i 



A:>1 



E^jII'"gJ2+ E ^il^G,l|2 

^^='E E A,||nG,||2 = A:i^A,||nc,||2. 



n 



<n ^ \ UQXQXQkUQk\ 

k>l 
k>l 

<^+II^^g||2E II^G"=ll2 



k>l 



<p+XZ'\\uGhY^j\\''G,h. 



Note that Lemma D.l is used to bound ||X^X(jfe ||2. This proves the second inequality of the lemma. 



The following lemma shows that the group Li-norm of the group Lasso estimator's non-signal 
part is small (compared to the group Li-norm of the parameter estimation error in the signal part). 



Lemma D.3 Let supp(/3) G Gs for some S* C {1, . . . , m}. Assume that for all j: 

Xj > 4p^{G,)'/'UX^^XG,)-'/'X^^eh/V^. 
Then the solution of ^ satisfies: 



i65 
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Proof The first order condition is: 



n 



/3g, 



By multiplying both sides by (/3 — /?)^, we obtain 



> -20 - pyX^X0 -P) = -2(/3 - P)^X^e + 



n- 



Therefore 



< 



^ A,- pG, - /3g, II2 + 2(/3 - W^^^ln 



ies i=i 



n 



(6) 



< E II^G, - /3g, II2 + 0.5 ^ A, II (/3 - II2. 

Note that the last inequality follows from the assumption of the lemma. By simplifying the above 
inequality, we obtain the desired bound. ■ 

The following lemma bounds parameter estimation error by combining the previous two lemmas. 



Lemma D.4 Lei supp(/?) G Gs for some S C {1, . . . , m}. Consider £ > 1 and let s = \Gs\ + £ + 
ko — 1. Define 

Xl =mini \] : IG5/I > 
(jes' 

p+ =y/{p+{s) - p.{2s - \Gsmp+{s - \Gs\) - p-i2s - \Gs\)). 



If for all j: 
and 



A, > Ap+{G,)'l^\\{Xl^XG,)-^''xlA\2/V^. 



6^< 



3 ^3' 
A_ 



then the solution of ^ satisfies: 



1.5 



i65 



ie5 
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Proof Define So as in Lemma D.2 Let G = UjeSuSoGj- By multiplying both sides of (|6| by 
W — P)g, we obtain 



(/3-^)5,/3g, 



jesuSo 



0. 



Similar to the proof in Lemma D.3 we use the assumptions on Xj to obtain: 



4n-\p-P)lX^X0-P) + ^X, 



< 



(7) 



Now, Lemma D.2 implies that 

- P)lX^X0 -P)>0- P)Ix^Xg0 - P)g - P+Xz'n 



By applying Lemma D.3 we have 

n-\p - P)lX^X0 - (3) >P-{G)\\0 - (3)Gf2 " Sp+Al^ 



^)g||25]A,||(/3-^)g,||2. 
its 



^)g||2J;A,||(/3-^)g,||2 

i6S 



>P-{G)\\0 - Ml - 3p+AlM X]\\0 - mil 

V ies 



>0.5p_(G)||(/3-^) 



G||2- 



The assumption of the lemma is used to derive the last inequality. Now plug this inequality into 
([T]), we have 



This implies 



P)g\\1 < 1.5p-(G)-i J^A,||/3g, -^G,||2 < 1.5p_(G)-MJ;a2||(/3-^)g||2. 
ie5 V jes 



^)G||i<2.25p_(G)-2^A|. 

JG5 



Now Lemma D.2 and Lemma D.3| imply that 

-^)Glli <o.25a:2 



<2.25A: 



Y^X,\\0-P)g,\\2 

Y^X,\\0-P)G,h 
<2.2hX-JY.^)\\0-P)G\\l 
By combining the previous two displayed inequalities, we obtain the lemma. 
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E Proof of Theorem 15.1 

Assumption |4 . 1 1 implies that with probability larger than I — r], uniformly for all groups j, we have 



It follows that with the choice of A, B, and Aj, > 4/9+(Gj)^/2||(Xj^XG,)~^/2xJ^,e||2/v^ for all 
j. Moreover, assumptions of the theorem also imply that /5+ < p+[s) — p_(2s), and 

A_ 



P-is) 



Note that we have used E,e5'[^^% + -^^l ^ ^J2jeS' ^? < 2 E,e5'[^^^i + B^]- 



Therefore the conditions of Lemma D.4 are satisfied. Its conclusion implies that 



< 



< 



< 



1.5 

'-p~A^) 



1.5 

'-p~^) 



1 + ,/2{A^k + B^g)/n. 



This proves the theorem. 
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